At this point in the course, we have introduced you to:
filter: extract rowsselect: extract columnsarrange: sort the rows of a data.frame based on entries from one or more columns (optionally with the combination of desc())mutate: add / change columnsDGEList) we use to store the assay, sample, and gene-level metadataYou have also been playng with the R package we have made for this course and some of the helper functions we provide. I will be using those here.
We have quantitated and packaged up all of the RNA-seq data for you so that you can retrieve it easily using the mbl_load_rnaseq data. This function returns a DGEList of gene expression data for the organism you are asking for.
By default it will return a data object that has data from both the “may” data generated before you got here, and the “mbl” data you generated here.
Last night I showed you how to run PCA on the “may” and “mbl” from mouse separately. This code block shows you how to run it on all the data together.
First load the data:
library(mbl2018)
library(dplyr)
library(ggplot2)
library(mbl2018)
ym.all <- mbl_load_rnaseq("mouse", dataset = "all")
ym.mbl <- mbl_load_rnaseq("mouse", dataset = "mbl")
ym.may <- mbl_load_rnaseq("mouse", dataset = "may")
ym.all <- mbl_load_rnaseq("mouse", dataset = "all")
ym.all <- calcNormFactors(ym.all)
The "dataset" column in the ym.all$samples table tells you which dataset the data came from:
table(ym.all$samples$dataset)
#>
#> may mbl
#> 25 70
You may want to use a subset of samples, perhaps just the “may” samples, or “knockout” samples. You can extract subsets of samples from this dgelist by testing values that are stored in the $samples columns.
For instance, you might just want the “dataset == "may" samples, or the”genotype == "knockout" samples. You can retrieve those like so:
y.may <- ym.all[, ym.all$samples$dataset == "may"] # 25 samples
y.ko <- ym.all[, ym.all$samples$genotype == "knockout"] # 34 samples
The mbl_pca function is a helper function to run PCA on a dataset and return data helpful to you.
Let’s run it on all data:
pca.all <- mbl_pca(ym.all)
The pca.all$data slot has a data.frame with the coordinates of our samples on each of the prinipcal components (“PC1”, “PC2”, etc.), and we have added the sample-level phenotypic data for these (like “source”) so you can use it for plotting.
Let’s take a loook at all the data together. We will color the data by source and shape it by the “dataset” column:
gg.all <- ggplot(pca.all$data, aes(PC1, PC2, color = source, shape = dataset)) +
geom_point()
gg.all
Recall that we can make ggplots interactive by using the plotly package and calling ggplotly on the plot object. We can use the text aesthetic to customize the elements in the hover/tooltips for each point.
For instance, you can include the sample_id into the hover tooltip to see where he outlier samples you have identified are.
library(plotly)
gg.all <- ggplot(pca.all$data,
aes(PC1, PC2, color = source,
text = paste("dataset: ", dataset, "<br>",
"sample_id: ", sample_id))) +
geom_point()
ggplotly(gg.all)
Please refer to the inst/tutorials/dge-mouse-ko.R script for a lot more exposition on what these steps mean, but I will just write the minimal DGE workflow.
First prepare data for analysis
Note that we were using “voom” before, but now using “voomWithQualityWeights”.
# create design matrix
design <- model.matrix(~ 0 + group, ym.all$samples)
# filter out lowly expressed genes
ymf <- ym.all[filterByExpr(ym.all, design),]
# run voomWithQualityWeights to prep data for linear models. I want to color
# bars by group:
cols <- mbl_create_color_map(ymf$samples$group) # This was not covered during
cols <- cols[ymf$samples$group] # the tutorial
vm <- voomWithQualityWeights(ymf, design, col = cols, plot = TRUE)
pcaw <- mbl_pca(vm)
ggplot(pcaw$data, aes(PC1, PC2, color = source, size = sample.weights)) +
geom_point()
Define the questions you want to ask
Here I setup a linear model to ask two questions:
Note that you have to “phrase your questions” using the terms found in the column names of your design matrix (see colnames(vm$design))
# Phrase the questions
cm <- makeContrasts(
cheek_KO = groupcheek_knockout - groupcheek_wildtype,
age = groupcheek_wildtype_Old - groupcheek_wildtype_Young,
tgko = grouptrigeminal_knockout - grouptrigeminal_wildtype,
levels = vm$design)
# Construct linear model to ask the questions
fit <- lmFit(vm, vm$design)
fit2 <- contrasts.fit(fit, cm)
# Do something about the variance
fit2 <- eBayes(fit2)
Get results for the “Cheek Knockout” Question
res.ko <- topTable(fit2, "cheek_KO", n = Inf)
# View(res.ko) # Take a peak
Get results for the “age” Question
res.age <- topTable(fit2, "age", n = Inf)
# View(res.age) # Take a peak
Did you find an interesting gene? We provided an mbl_plot_expression function to plot individual genes.
One of the top hits in the cheek_KO comparison is “Krtap20-2”. You can plot it like so:
mbl_plot_expression(vm, "Krtap20-2", "group")
There are a few tweaks you might want to make to this plot.
Maybe you just want to plot the expression of this gene in the samples whose group we tested, ie. just in “cheek_knockout” and “cheek_wildtype”. You can use the values in the vm$targets$group colum to select just those samples and plot:
mbl_plot_expression(
vm[, vm$targets$group %in% c("cheek_knockout", "cheek_wildtype")],
"Krtap20-2", "group")
You might be wondering what the third parameter is for (“group”) it tells mbl_plot_expression how to split the expression of this gene across the x-axis. The value you put there must be the name of one of the columns of vm$targets
What if you wanted to plot the expression of this gene, split by “source”
mbl_plot_expression(vm, "Krtap20-2", "source")
If this isn’t enough plotting for you, we provide the mbl_tidy function to take your expression expression dataset and turn it into a data.frame that you can use ggplot with.
vmtbl <- mbl_tidy(vm)
Take a peak at vmtbl to see what it is. Every gene expression observation is in one row of the data.frame, which means there are nrow x ncol rows in this dataset. Each row also has the gene information and the sample information.
The log2 normalized expression of this gene is in the “cpm” column
What we will do here is subset all the data down to those rows that are just from for this gene.
dat <- filter(vmtbl, symbol == "Krtap20-2")
Now we can plot with normal ggplot:
ggplot(dat, aes(x = group, y = cpm)) +
geom_boxplot() +
geom_point() +
theme(axis.text.x=element_text(angle=90, hjust=1)) # I am rotating the x-axis labels
Or you can filter down the data to just the symbol and the You can use ggplot your own way.
dat <- filter(vmtbl,
symbol == "Krtap20-2",
group %in% c("cheek_knockout", "cheek_wildtype"))
ggplot(dat, aes(x = group, y = cpm)) +
geom_boxplot() +
geom_point() +
theme(axis.text.x=element_text(angle=90, hjust=1)) # I am rotating the x-axis labels
You can use the coolmap function to draw heatmaps using your vm object.
The rownames() of the vm object are ensembl gene identifiers, so we need to use the ids of the genes we are interested in to draw the heatmap.
We will:
res.ko result tablevm object and pass that into coolmaplibrary(gplots)
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
genez <- head(res.ko$ens_gene, 20)
coolmap(vm[genez,], mar = c(5,10))
Just like we did in the mbl_plot_expression function, we can select out just the samples we are interested in:
coolmap(vm[genez, vm$targets$group %in% c("cheek_knockout", "cheek_wildtype")],
mar = c(5,10))